Assessing Urban Livability: A Multifactor Analysis of Rent, Noise and Crime in New York City

Group 6 (Stat Squad: Abhay Shah, David Skorodinsky, Javier Miguelez, Sarah Doctor, Tian Lan)

The objective of this study is to identify neighborhoods in New York City that offer a high quality of life, taking into account the rent prices, the variation in noise complaints, and crime rates across different areas. By analyzing these factors, the livability conditions of various areas can be recognized. This study will benefit potential renters and policymakers who are looking to either move within NYC or allocate resources effectively in the city’s diverse neighborhoods.

The research will be conducted through the following questions:

  1. How are rent prices expected to change in the next two years across New York City?

  2. How do the noise complaint types vary across neighborhoods in New York City?

  3. How do the crimes vary across neighborhoods in New York City?

Dataset and Pre-processing

1.1 the Noise Dataset

Data Source: https://data.cityofnewyork.us/Social-Services/311-Noise-Complaints/p5f6-bkga/about_data

raw_noise =read.csv("/Users/tianlanmac16/Desktop/Columbia AA/5205-R/StatSquad/NOISE/RAWNOISE_311_Noise_Complaints_20240222.csv")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(skimr)

1.1.1 Cleanning

To get summarized and quantitative data from the raw noise data set, we will extract key info columns to analyze and aggregate including:

Unique.Key: unique identifier of each complaint Incident.Zip: zipcode to locate Community.Board: a match to precint to locate Borough: borough names Created.Date: Exact date of occurrence for the reported event Complaint.Type:Residential/Commercial..etc type of complaints Descriptor: detail description on noise category Latitude Longitude

noise_subset = raw_noise %>%
  select(Unique.Key, Created.Date, Community.Board, Borough, Incident.Zip, Incident.Address, Complaint.Type, Descriptor, Latitude, Longitude)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
noise_subset$Created.Date = mdy_hms(noise_subset$Created.Date)
filtered_noise_subset <- noise_subset %>%
  filter(year(Created.Date) >= 2012 & year(Created.Date) <= 2022)
library(dplyr)
library(stringr)

imputed_noise <- filtered_noise_subset %>%
    mutate(Borough = str_replace(Borough, "^$", "Unspecified"),
           Community.Board = str_replace(Community.Board, "^$", "0 Unspecified"))
skim(imputed_noise)
Data summary
Name imputed_noise
Number of rows 5309712
Number of columns 10
_______________________
Column type frequency:
character 5
numeric 4
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Community.Board 0 1 8 25 0 77 0
Borough 0 1 5 13 0 6 0
Incident.Address 0 1 0 59 338563 533660 0
Complaint.Type 0 1 5 24 0 10 0
Descriptor 0 1 0 71 2 35 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Unique.Key 0 1.00 41413536.15 9743397.46 22425840.00 33219755.50 42596970.50 50390328.25 56556254.00 ▅▆▅▆▇
Incident.Zip 10712 1.00 10694.17 556.46 0.00 10036.00 10467.00 11225.00 12345.00 ▁▁▁▁▇
Latitude 32594 0.99 40.76 0.08 40.50 40.69 40.75 40.83 40.91 ▁▃▇▆▆
Longitude 32594 0.99 -73.92 0.07 -74.26 -73.96 -73.93 -73.88 -73.70 ▁▁▇▆▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
Created.Date 0 1 2012-01-01 00:00:11 2022-12-31 23:59:13 2019-05-06 01:05:41 5141605
unique(imputed_noise$Borough)
## [1] "MANHATTAN"     "BRONX"         "QUEENS"        "BROOKLYN"     
## [5] "STATEN ISLAND" "Unspecified"
unique(imputed_noise$Community.Board)
##  [1] "07 MANHATTAN"              "11 BRONX"                 
##  [3] "12 QUEENS"                 "02 QUEENS"                
##  [5] "11 QUEENS"                 "07 BROOKLYN"              
##  [7] "10 MANHATTAN"              "04 BRONX"                 
##  [9] "09 QUEENS"                 "03 QUEENS"                
## [11] "02 BRONX"                  "03 BROOKLYN"              
## [13] "05 BRONX"                  "04 BROOKLYN"              
## [15] "09 BRONX"                  "06 BRONX"                 
## [17] "01 BROOKLYN"               "04 QUEENS"                
## [19] "06 MANHATTAN"              "01 STATEN ISLAND"         
## [21] "12 MANHATTAN"              "02 BROOKLYN"              
## [23] "08 BROOKLYN"               "01 MANHATTAN"             
## [25] "12 BRONX"                  "16 BROOKLYN"              
## [27] "09 MANHATTAN"              "12 BROOKLYN"              
## [29] "09 BROOKLYN"               "05 QUEENS"                
## [31] "07 BRONX"                  "06 BROOKLYN"              
## [33] "13 QUEENS"                 "08 BRONX"                 
## [35] "01 QUEENS"                 "05 MANHATTAN"             
## [37] "05 BROOKLYN"               "11 BROOKLYN"              
## [39] "01 BRONX"                  "02 MANHATTAN"             
## [41] "07 QUEENS"                 "11 MANHATTAN"             
## [43] "02 STATEN ISLAND"          "17 BROOKLYN"              
## [45] "10 BROOKLYN"               "08 MANHATTAN"             
## [47] "04 MANHATTAN"              "03 BRONX"                 
## [49] "18 BROOKLYN"               "14 BROOKLYN"              
## [51] "10 BRONX"                  "03 MANHATTAN"             
## [53] "14 QUEENS"                 "08 QUEENS"                
## [55] "03 STATEN ISLAND"          "10 QUEENS"                
## [57] "81 QUEENS"                 "15 BROOKLYN"              
## [59] "13 BROOKLYN"               "06 QUEENS"                
## [61] "Unspecified MANHATTAN"     "55 BROOKLYN"              
## [63] "64 MANHATTAN"              "Unspecified BRONX"        
## [65] "0 Unspecified"             "95 STATEN ISLAND"         
## [67] "27 BRONX"                  "Unspecified BROOKLYN"     
## [69] "Unspecified QUEENS"        "Unspecified STATEN ISLAND"
## [71] "82 QUEENS"                 "84 QUEENS"                
## [73] "28 BRONX"                  "80 QUEENS"                
## [75] "26 BRONX"                  "83 QUEENS"                
## [77] "56 BROOKLYN"

Due to limited time and computing power, instead of cleaning unspecified areas and zipcode, we will delete the data with missing values, with an approximate of 0.5% of our raw data to proceed.

cleaned_noise = imputed_noise %>%
  filter(!(str_detect(Community.Board, "^0 Unspecified"))) %>%
  filter(!(str_detect(Community.Board, "^Unspecified"))) %>% 
  filter(!(str_detect(Descriptor, "^\\s*$"))) 

cleaned_noise = cleaned_noise[!is.na(cleaned_noise$Incident.Zip), ]
nrow(cleaned_noise)
## [1] 5279309

After reduction, we will keep 99.43% of our raw data in the Noise dataset for quantitative summary.

sprintf("%.10f", 5279309/5309712) # we kept 99.43% of the raw data 
## [1] "0.9942740774"
cleaned_noise$Incident.Zip <- as.character(cleaned_noise$Incident.Zip)
cleaned_noise_final <- cleaned_noise %>%
  mutate(Borough = ifelse(Borough == 'MANHATTAN' & Community.Board == '08 BRONX', 'BRONX', Borough)) %>%
  mutate(Borough = ifelse(Borough == 'BRONX' & Community.Board == '01 QUEENS', 'QUEENS', Borough)) 

cleaned_noise_final %>%
  filter(Borough == 'MANHATTAN')%>%
  group_by(Community.Board) %>%
  summarise(complaints = n()) 
## # A tibble: 13 × 2
##    Community.Board complaints
##    <chr>                <int>
##  1 01 MANHATTAN         56867
##  2 02 MANHATTAN         96231
##  3 03 MANHATTAN        171221
##  4 04 MANHATTAN        109874
##  5 05 MANHATTAN         72173
##  6 06 MANHATTAN         79380
##  7 07 MANHATTAN        136185
##  8 08 MANHATTAN         86978
##  9 09 MANHATTAN        132066
## 10 10 MANHATTAN        167450
## 11 11 MANHATTAN        100502
## 12 12 MANHATTAN        316333
## 13 64 MANHATTAN          3118

1.1.2 Summary

noise_summary<- cleaned_noise_final%>%
  mutate(year = year(Created.Date), 
         month = month(Created.Date)) %>% 
  group_by(Borough, Community.Board, Incident.Zip, Complaint.Type, Descriptor, year, month) %>%
  summarise(complaints = n()) 
## `summarise()` has grouped output by 'Borough', 'Community.Board',
## 'Incident.Zip', 'Complaint.Type', 'Descriptor', 'year'. You can override using
## the `.groups` argument.
noise_summary <- noise_summary %>%
  arrange(Incident.Zip, year, month, Complaint.Type, Descriptor)
write.csv(noise_summary, 'noise_summary.csv',row.names = F)
write.csv(cleaned_noise_final, 'cleaned_noise_final.csv',row.names = F)

1.2 the Crime Dataset

Data Source: https://www.nyc.gov/site/nypd/stats/crime-statistics/borough-and-precinct-crime-stats.page#manhattan

1.2.1 Cleanning

library(dplyr)
library(skimr)
raw_crime=read.csv("//Users/tianlanmac16/Desktop/Columbia AA/5205-R/StatSquad/RECLEAN/RAW_NYPD_Complaint_Data 2012-2022.csv")
skim(raw_crime)
Data summary
Name raw_crime
Number of rows 11405
Number of columns 36
_______________________
Column type frequency:
character 26
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
CMPLNT_NUM 0 1 9 14 0 11405 0
BORO_NM 0 1 5 13 0 6 0
CMPLNT_FR_DT 0 1 10 10 0 1478 0
CMPLNT_FR_TM 0 1 8 8 0 1007 0
CMPLNT_TO_DT 0 1 0 10 1175 1356 0
CMPLNT_TO_TM 0 1 6 8 0 990 0
CRM_ATPT_CPTD_CD 0 1 9 9 0 2 0
HADEVELOPT 0 1 4 15 0 14 0
JURIS_DESC 0 1 5 24 0 16 0
LAW_CAT_CD 0 1 6 11 0 3 0
LOC_OF_OCCUR_DESC 0 1 6 11 0 6 0
OFNS_DESC 0 1 4 36 0 45 0
PARKS_NM 0 1 6 30 0 30 0
PATROL_BORO 0 1 17 25 0 8 0
PD_DESC 0 1 6 71 0 218 0
PREM_TYP_DESC 0 1 3 28 0 76 0
RPT_DT 0 1 10 10 0 363 0
STATION_NAME 0 1 6 30 0 66 0
SUSP_AGE_GROUP 0 1 3 7 0 9 0
SUSP_RACE 0 1 5 30 0 8 0
SUSP_SEX 0 1 1 6 0 4 0
VIC_AGE_GROUP 0 1 3 7 0 8 0
VIC_RACE 0 1 5 30 0 8 0
VIC_SEX 0 1 1 1 0 5 0
Lat_Lon 0 1 0 40 1 7729 0
New.Georeferenced.Column 0 1 0 45 1 7729 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ADDR_PCT_CD 4 1.00 65.93 35.01 1.00 41.00 67.00 103.00 123.00 ▅▆▆▃▇
HOUSING_PSA 10760 0.06 7529.66 14632.45 218.00 463.00 696.00 1291.00 47683.00 ▇▁▁▁▁
JURISDICTION_CODE 0 1.00 1.49 10.08 0.00 0.00 0.00 0.00 97.00 ▇▁▁▁▁
KY_CD 0 1.00 247.88 149.82 101.00 109.00 233.00 341.00 678.00 ▇▁▆▁▂
PD_CD 18 1.00 424.82 193.12 101.00 321.00 407.00 638.00 922.00 ▅▇▃▆▁
TRANSIT_DISTRICT 11223 0.02 12.23 11.10 1.00 3.00 11.00 20.00 34.00 ▇▆▁▁▃
X_COORD_CD 1 1.00 1005269.21 22415.07 913511.00 991626.50 1004593.00 1017933.00 1066893.00 ▁▁▇▆▂
Y_COORD_CD 1 1.00 206258.29 30351.98 123626.00 183890.00 205330.00 232216.75 271303.00 ▁▅▇▆▃
Latitude 1 1.00 40.73 0.08 40.51 40.67 40.73 40.80 40.91 ▁▅▇▆▃
Longitude 1 1.00 -73.92 0.08 -74.25 -73.97 -73.93 -73.88 -73.70 ▁▁▇▆▂

To get summarized and quntitative data from the raw crime dataset, we will extract key info columns to analyze and aggregate including: CMPLNT_NUM: unique identifier of each case ADDR_PCT_CD: The precinct in which the incident occurred BORO_NM: borough names CMPLNT_FR_DT: Exact date of occurrence for the reported event LAW_CAT_CD: Level of offense: felony, misdemeanor, violation OFNS_DESC: Description of offense corresponding with key code Latitude Longitude with above columns, we will be able to summarize the total crime numbers each month by location and crime type:

crime_subset = raw_crime %>%
  select(CMPLNT_NUM, CMPLNT_FR_DT, BORO_NM, ADDR_PCT_CD, LAW_CAT_CD, OFNS_DESC, Latitude, Longitude) 

crime_subset %>%
  filter(is.na(ADDR_PCT_CD))
##    CMPLNT_NUM CMPLNT_FR_DT  BORO_NM ADDR_PCT_CD LAW_CAT_CD
## 1 271421229H1   07/05/2020 BROOKLYN          NA     FELONY
## 2 270324746H1   08/06/2022    BRONX          NA     FELONY
## 3 270325369H1   07/22/2022    BRONX          NA     FELONY
## 4 268272009H1   01/02/2022   QUEENS          NA     FELONY
##                         OFNS_DESC Latitude Longitude
## 1 MURDER & NON-NEGL. MANSLAUGHTER       NA        NA
## 2 MURDER & NON-NEGL. MANSLAUGHTER 40.87584 -73.90313
## 3 MURDER & NON-NEGL. MANSLAUGHTER 40.87366 -73.89837
## 4 MURDER & NON-NEGL. MANSLAUGHTER 40.68067 -73.84272

Based on the longitude and latitude data, we can identify the precint info and will impute as below: 270325369H1 50 bronx 270324746H1 50 bronx 268272009H1 106 queens 271421229H1 67 brooklyn

crime_subset <- crime_subset %>%
  mutate(ADDR_PCT_CD = case_when(
    CMPLNT_NUM == "270325369H1" ~ 50,
    CMPLNT_NUM == "270324746H1" ~ 50,
    CMPLNT_NUM == "268272009H1" ~ 106,
    CMPLNT_NUM == "271421229H1" ~ 67,
    TRUE ~ ADDR_PCT_CD
  ))
#cross checking imputed data
crime_subset %>%
  filter(CMPLNT_NUM %in% c("270325369H1", "270324746H1", "268272009H1", "271421229H1"))
##    CMPLNT_NUM CMPLNT_FR_DT  BORO_NM ADDR_PCT_CD LAW_CAT_CD
## 1 271421229H1   07/05/2020 BROOKLYN          67     FELONY
## 2 270324746H1   08/06/2022    BRONX          50     FELONY
## 3 270325369H1   07/22/2022    BRONX          50     FELONY
## 4 268272009H1   01/02/2022   QUEENS         106     FELONY
##                         OFNS_DESC Latitude Longitude
## 1 MURDER & NON-NEGL. MANSLAUGHTER       NA        NA
## 2 MURDER & NON-NEGL. MANSLAUGHTER 40.87584 -73.90313
## 3 MURDER & NON-NEGL. MANSLAUGHTER 40.87366 -73.89837
## 4 MURDER & NON-NEGL. MANSLAUGHTER 40.68067 -73.84272

now, assign district by precinct:

crime_subset <- crime_subset %>%
  mutate(BORO_NM = case_when(
    ADDR_PCT_CD >= 1 & ADDR_PCT_CD <= 35 ~ "MANHATTAN",
    ADDR_PCT_CD >= 40 & ADDR_PCT_CD <= 52 ~ "BRONX",
    ADDR_PCT_CD >= 60 & ADDR_PCT_CD <= 94 ~ "BROOKLYN",
    ADDR_PCT_CD >= 100 & ADDR_PCT_CD <= 115 ~ "QUEENS",
    ADDR_PCT_CD >= 120 & ADDR_PCT_CD <= 125 ~ "STATEN ISLAND",
    TRUE ~ BORO_NM
  ))
unique(crime_subset$BORO_NM)
## [1] "BROOKLYN"      "STATEN ISLAND" "MANHATTAN"     "QUEENS"       
## [5] "BRONX"

based on skimming the data, all 4 missing in OFNS_DESC are OBSCENITY based on the col PL_DESC

crime_subset <- crime_subset %>%
  mutate(OFNS_DESC = ifelse(OFNS_DESC == "(null)", "OBSCENIT", OFNS_DESC))
unique(crime_subset$OFNS_DESC)
##  [1] "MURDER & NON-NEGL. MANSLAUGHTER"     
##  [2] "THEFT-FRAUD"                         
##  [3] "SEX CRIMES"                          
##  [4] "PETIT LARCENY"                       
##  [5] "GRAND LARCENY"                       
##  [6] "HARRASSMENT 2"                       
##  [7] "OFF. AGNST PUB ORD SENSBLTY &"       
##  [8] "FRAUDS"                              
##  [9] "MISCELLANEOUS PENAL LAW"             
## [10] "CRIMINAL MISCHIEF & RELATED OF"      
## [11] "GRAND LARCENY OF MOTOR VEHICLE"      
## [12] "ASSAULT 3 & RELATED OFFENSES"        
## [13] "VEHICLE AND TRAFFIC LAWS"            
## [14] "OTHER OFFENSES RELATED TO THEF"      
## [15] "OTHER STATE LAWS (NON PENAL LA"      
## [16] "RAPE"                                
## [17] "OBSCENIT"                            
## [18] "BURGLARY"                            
## [19] "UNAUTHORIZED USE OF A VEHICLE"       
## [20] "ROBBERY"                             
## [21] "FELONY ASSAULT"                      
## [22] "DANGEROUS DRUGS"                     
## [23] "POSSESSION OF STOLEN PROPERTY"       
## [24] "NYS LAWS-UNCLASSIFIED FELONY"        
## [25] "FORGERY"                             
## [26] "CRIMINAL TRESPASS"                   
## [27] "OFFENSES AGAINST PUBLIC ADMINI"      
## [28] "ARSON"                               
## [29] "OFFENSES INVOLVING FRAUD"            
## [30] "ANTICIPATORY OFFENSES"               
## [31] "OFFENSES AGAINST THE PERSON"         
## [32] "DANGEROUS WEAPONS"                   
## [33] "AGRICULTURE & MRKTS LAW-UNCLASSIFIED"
## [34] "PROSTITUTION & RELATED OFFENSES"     
## [35] "ADMINISTRATIVE CODE"                 
## [36] "THEFT OF SERVICES"                   
## [37] "JOSTLING"                            
## [38] "INTOXICATED & IMPAIRED DRIVING"      
## [39] "KIDNAPPING & RELATED OFFENSES"       
## [40] "OFFENSES AGAINST PUBLIC SAFETY"      
## [41] "HOMICIDE-NEGLIGENT,UNCLASSIFIE"      
## [42] "OFFENSES RELATED TO CHILDREN"        
## [43] "FRAUDULENT ACCOSTING"                
## [44] "OTHER STATE LAWS"                    
## [45] "PETIT LARCENY OF MOTOR VEHICLE"
library(lubridate)
crime_subset$CMPLNT_FR_DT = mdy(crime_subset$CMPLNT_FR_DT)

Based on the cleaned file, there is only one record without geographical latitude and longitude, we will impute this cell with mice

skim(crime_subset)
Data summary
Name crime_subset
Number of rows 11405
Number of columns 8
_______________________
Column type frequency:
character 4
Date 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
CMPLNT_NUM 0 1 9 14 0 11405 0
BORO_NM 0 1 5 13 0 5 0
LAW_CAT_CD 0 1 6 11 0 3 0
OFNS_DESC 0 1 4 36 0 45 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
CMPLNT_FR_DT 0 1 2012-01-01 2022-12-31 2022-10-11 1478

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ADDR_PCT_CD 0 1 65.93 35.00 1.00 41.00 67.00 103.00 123.00 ▅▆▆▃▇
Latitude 1 1 40.73 0.08 40.51 40.67 40.73 40.80 40.91 ▁▅▇▆▃
Longitude 1 1 -73.92 0.08 -74.25 -73.97 -73.93 -73.88 -73.70 ▁▁▇▆▂

1.2.2 Mice Impute

since we are using crime dataset with spatial analysis, so we will need to impute this 1 missing value in longitude and latitude

library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
mice_crime = mice::complete(mice(crime_subset,seed = 617))
## 
##  iter imp variable
##   1   1  Latitude  Longitude
##   1   2  Latitude  Longitude
##   1   3  Latitude  Longitude
##   1   4  Latitude  Longitude
##   1   5  Latitude  Longitude
##   2   1  Latitude  Longitude
##   2   2  Latitude  Longitude
##   2   3  Latitude  Longitude
##   2   4  Latitude  Longitude
##   2   5  Latitude  Longitude
##   3   1  Latitude  Longitude
##   3   2  Latitude  Longitude
##   3   3  Latitude  Longitude
##   3   4  Latitude  Longitude
##   3   5  Latitude  Longitude
##   4   1  Latitude  Longitude
##   4   2  Latitude  Longitude
##   4   3  Latitude  Longitude
##   4   4  Latitude  Longitude
##   4   5  Latitude  Longitude
##   5   1  Latitude  Longitude
##   5   2  Latitude  Longitude
##   5   3  Latitude  Longitude
##   5   4  Latitude  Longitude
##   5   5  Latitude  Longitude
## Warning: Number of logged events: 4
write.csv(mice_crime, 'crime_cleaned_for_Javier.csv',row.names = F)

1.2.3 Summary

#by borough - precinct- law- offense type- year- months
crime_summary<- mice_crime %>%
  mutate(year = year(CMPLNT_FR_DT), 
         month = month(CMPLNT_FR_DT)) %>% 
  group_by(BORO_NM, ADDR_PCT_CD , LAW_CAT_CD, OFNS_DESC, year, month) %>%
  summarise(complaints = n()) 
## `summarise()` has grouped output by 'BORO_NM', 'ADDR_PCT_CD', 'LAW_CAT_CD',
## 'OFNS_DESC', 'year'. You can override using the `.groups` argument.
crime_summary <- crime_summary %>%
  arrange(ADDR_PCT_CD, year, month, LAW_CAT_CD, OFNS_DESC)

1.3 the Rent Dataset

Data Source: https://streeteasy.com/blog/data-dashboard

1.3.1 Cleanning

library(dplyr)
library(tidyr)
rental_data <- read.csv("medianAskingRent_All.csv")
#Clean Data
cleaned_rentaldata <- rental_data |>
  pivot_longer(cols = 4:172, names_to = "Date", values_to = "median_rental_price") |> #Pivot Longer
  separate_wider_delim(Date, delim = ".",names = c("Year", "Month")) #Split Date
cleaned_rentaldata$Year <- gsub("X","",cleaned_rentaldata$Year) #Remove X value from Year
#Filter Data
cleaned_rentaldata <- cleaned_rentaldata |>
  filter(Year %in% 
           c("2010", "2011","2012","2013","2014","2015","2016","2017",
             "2018","2019","2020","2021","2022","2023")) 
nrow(cleaned_rentaldata)
## [1] 33264
city_borough = cleaned_rentaldata %>%
  filter(areaName %in% 
           c("NYC", "Manhattan", "Brooklyn", "Queens", "Bronx", "Staten Island"))
city_borough %>%
  filter(is.na(median_rental_price))
## # A tibble: 29 × 6
##    areaName      Borough       areaType Year  Month median_rental_price
##    <chr>         <chr>         <chr>    <chr> <chr>               <dbl>
##  1 Staten Island Staten Island borough  2010  01                     NA
##  2 Staten Island Staten Island borough  2010  02                     NA
##  3 Staten Island Staten Island borough  2010  03                     NA
##  4 Staten Island Staten Island borough  2010  04                     NA
##  5 Staten Island Staten Island borough  2010  05                     NA
##  6 Staten Island Staten Island borough  2010  06                     NA
##  7 Staten Island Staten Island borough  2010  07                     NA
##  8 Staten Island Staten Island borough  2010  08                     NA
##  9 Staten Island Staten Island borough  2010  09                     NA
## 10 Staten Island Staten Island borough  2010  10                     NA
## # ℹ 19 more rows
city_borough <- city_borough %>%
  mutate(Borough = case_when(
    areaName == "NYC" ~ "NYC",
    TRUE ~ Borough
  ))
Staten_rent = city_borough %>%
  filter(areaName == "Staten Island")
Staten_rent$Year <- as.integer(Staten_rent$Year)
Staten_rent$Month <- as.integer(Staten_rent$Month)

skim(Staten_rent)
Data summary
Name Staten_rent
Number of rows 168
Number of columns 6
_______________________
Column type frequency:
character 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
areaName 0 1 13 13 0 1 0
Borough 0 1 13 13 0 1 0
areaType 0 1 7 7 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1.00 2016.50 4.04 2010 2013.00 2016.5 2020.00 2023 ▇▇▅▇▇
Month 0 1.00 6.50 3.46 1 3.75 6.5 9.25 12 ▇▅▅▅▇
median_rental_price 29 0.83 1879.73 232.96 1175 1800.00 1932.0 2000.00 2325 ▁▁▃▇▂

1.3.2 Mice Impute

library(mice)
library(randomForest)
library(ranger)
## 
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
## 
##     importance
mice_staten_rent = mice::complete(mice(Staten_rent,method = "rf", seed = 617, rfPackage='randomForest')) 
## 
##  iter imp variable
##   1   1  median_rental_price
##   1   2  median_rental_price
##   1   3  median_rental_price
##   1   4  median_rental_price
##   1   5  median_rental_price
##   2   1  median_rental_price
##   2   2  median_rental_price
##   2   3  median_rental_price
##   2   4  median_rental_price
##   2   5  median_rental_price
##   3   1  median_rental_price
##   3   2  median_rental_price
##   3   3  median_rental_price
##   3   4  median_rental_price
##   3   5  median_rental_price
##   4   1  median_rental_price
##   4   2  median_rental_price
##   4   3  median_rental_price
##   4   4  median_rental_price
##   4   5  median_rental_price
##   5   1  median_rental_price
##   5   2  median_rental_price
##   5   3  median_rental_price
##   5   4  median_rental_price
##   5   5  median_rental_price
## Warning: Number of logged events: 3
city_borough$Year <- as.integer(city_borough$Year)
city_borough$Month <- as.integer(city_borough$Month)

city_borough = city_borough %>%
  filter(!(areaName == "Staten Island"))
city_borough <- rbind(city_borough, mice_staten_rent)
skim(city_borough)
Data summary
Name city_borough
Number of rows 1008
Number of columns 6
_______________________
Column type frequency:
character 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
areaName 0 1 3 13 0 6 0
Borough 0 1 3 13 0 6 0
areaType 0 1 4 7 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2016.50 4.03 2010 2013.00 2016.5 2020.00 2023 ▇▇▅▇▇
Month 0 1 6.50 3.45 1 3.75 6.5 9.25 12 ▇▅▅▅▇
median_rental_price 0 1 2449.88 640.01 1175 1966.00 2450.0 2900.00 4395 ▅▇▇▃▁

csv for time analysis is now cleaned

write.csv(city_borough, 'city_borough_rent.csv',row.names = F)

H1 Time Series Analysis

Research Question: How are rent prices expected to change in the next two years across New York City?

How are rent prices expected to change in the next two years across New York City? To answer this, we will utilize a time-series analysis of the median rent prices for NYC from 2010 to 2023. This will help discern patterns and trends that may impact future prices and explore the variability of this across the city.

library(ggplot2);library(ggthemes);library(gridExtra)  # For plots 
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
library(quantmod);library(xts);library(zoo) # For using xts class objects
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(forecast) # Set of forecasting functions
library(fpp); library(fpp2) # Datasets from Forecasting text by Rob Hyndman
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: tseries
## 
## Attaching package: 'fpp2'
## The following objects are masked from 'package:fpp':
## 
##     ausair, ausbeer, austa, austourists, debitcards, departures,
##     elecequip, euretail, guinearice, oil, sunspotarea, usmelec
library(tseries) # for a statistical test
library(dplyr) # Data wrangling

Explanatory Plot

time_retal = city_borough
time_retal$Date = as.Date(paste(time_retal$Year, time_retal$Month, "01", sep="-"))
time_retal$`median_rental_price` <- as.numeric(time_retal$`median_rental_price`)

Plot for NYC

# NYC median rent over time
time_retal %>%
  filter(areaType == "city")%>%
  ggplot(aes(x=Date, y=median_rental_price)) + 
  geom_line() +
  facet_grid(areaName~.) + 
  labs(title = "Median Rent Over Time NYC", 
       x = "Date", 
       y = "Price") 

Plot for 5 boroughs

# 5 borogh median rent over time
time_retal %>%
  filter(areaType == "borough")%>%
  ggplot(aes(x=Date, y=median_rental_price)) + 
  geom_line() +
  facet_grid(areaName~.) + 
  labs(title = "Median Rent Over Time by Borough", 
       x = "Date", 
       y = "Price") 

Initial Observations: By taking a descriptive analysis on the median rental prices, we can observe a structural break in the data post-2020, which can be attributed to the macroeconomic impact of Covid-19, causing a drop in rental prices. However, we can observe that this was overset by an upward spike in the past 2 years. The seasonal plot shows how rental prices fluctuate in seasons, with minor peaks in the summer months and troughs in the winter months.

TS NYC Preprocessing

NYCtime_retal = city_borough %>%
  filter(areaType == "city") 

NYCtime_retal$Date <- as.Date(paste(NYCtime_retal$Year, NYCtime_retal$Month, "01", sep="-"))

# Convert the Median Rental Price to numeric
NYCtime_retal$`median_rental_price` <- as.numeric(NYCtime_retal$`median_rental_price`)

# Convert to a TS object
NYC_ts <- ts(NYCtime_retal$`median_rental_price`, start=c(2010, 1), frequency=12)
NYC_ts 
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 2010 2650 2600 2600 2650 2695 2700 2750 2750 2800 2825 2800 2750
## 2011 2750 2725 2748 2795 2800 2895 2896 2950 3000 3000 3000 2995
## 2012 2950 2925 2950 2995 2995 2995 2995 3000 3050 3000 3000 3000
## 2013 2995 3000 3000 2995 2995 2995 2995 3000 3000 2995 2950 2930
## 2014 2900 2900 2900 2950 2950 2903 2887 2887 2850 2825 2800 2800
## 2015 2800 2800 2800 2850 2850 2899 2900 2950 2900 2895 2825 2850
## 2016 2860 2895 2900 2925 2950 2975 2953 2950 2900 2867 2800 2800
## 2017 2750 2750 2700 2750 2770 2795 2800 2856 2800 2795 2700 2700
## 2018 2695 2672 2699 2750 2795 2800 2825 2800 2784 2700 2658 2654
## 2019 2695 2700 2700 2800 2850 2900 2950 2949 2926 2900 2899 2895
## 2020 2900 2900 2925 2995 2950 2890 2800 2745 2650 2550 2500 2500
## 2021 2499 2500 2495 2500 2520 2600 2695 2700 2750 2700 2750 2800
## 2022 2900 2995 3000 3200 3350 3500 3550 3550 3500 3500 3400 3405
## 2023 3495 3500 3500 3675 3728 3750 3795 3750 3700 3600 3500 3500

Visualize NYC_ts

ggseasonplot(austourists) +
  labs(title = "NYC Median Rent by year", 
       x = "Date", 
       y = "Price")

ggseasonplot(NYC_ts, polar=T) + 
  labs(title = "NYC Median Rent by year", 
       x = "Date", 
       y = "Price") # add marks hard to read ask gpt

NYC_ts%>%
  stl(s.window = 'periodic')%>%
  autoplot()

library(forecast)
pacf(x = NYC_ts)

Train and Test

train_NYC = window(NYC_ts,start=c(2010,01),end=c(2022,12)) #11 years on train
test_NYC = window(NYC_ts,start=c(2023,01),end=c(2023,12)) #1 year on test

H1.1 Simple Forecasting Methods

Seasonal Native Method
seasonal_naive_model = snaive(train_NYC,h=12)
seasonal_naive_model$mean
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 2023 2900 2995 3000 3200 3350 3500 3550 3550 3500 3500 3400 3405
accuracy(seasonal_naive_model,x = NYC_ts)
##                     ME     RMSE      MAE      MPE     MAPE     MASE      ACF1
## Training set  50.55556 255.8956 168.0278 1.293095 5.700921 1.000000 0.9598879
## Test set     303.58333 348.7010 303.5833 8.417811 8.417811 1.806745 0.7521645
##              Theil's U
## Training set        NA
## Test set      4.295927
Drift Method
drift_model = rwf(train_NYC,h=12,drift = T)
drift_model$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2023 3409.871 3414.742 3419.613 3424.484 3429.355 3434.226 3439.097 3443.968
##           Sep      Oct      Nov      Dec
## 2023 3448.839 3453.710 3458.581 3463.452
accuracy(drift_model,x = NYC_ts)
##                         ME      RMSE       MAE         MPE     MAPE      MASE
## Training set -1.195309e-13  45.99884  32.08075 -0.02137561 1.115442 0.1909253
## Test set      1.877554e+02 219.94235 187.75538  5.08626853 5.086269 1.1174068
##                   ACF1 Theil's U
## Training set 0.4292964        NA
## Test set     0.7012110  2.980163

H1.2 Exponential Smoothing Models

Simple Exponential Smoothing
ses_model = ses(train_NYC,h = 12)
ses_model$mean
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 2023 3405 3405 3405 3405 3405 3405 3405 3405 3405 3405 3405 3405
accuracy(ses_model,x = NYC_ts) 
##                    ME      RMSE       MAE       MPE     MAPE      MASE
## Training set   4.5182  46.28463  31.39466 0.1360588 1.091805 0.1868421
## Test set     219.4172 247.76192 219.41717 5.9591531 5.959153 1.3058386
##                   ACF1 Theil's U
## Training set 0.4346218        NA
## Test set     0.7082495  3.360923
Holt Method
holt_model = holt(train_NYC,h=12)
holt_model$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2023 3370.676 3342.576 3314.476 3286.376 3258.275 3230.175 3202.075 3173.975
##           Sep      Oct      Nov      Dec
## 2023 3145.874 3117.774 3089.674 3061.574
accuracy(holt_model,x=NYC_ts)
##                        ME      RMSE       MAE         MPE      MAPE      MASE
## Training set   0.02514677  43.10313  32.73571  0.01975212  1.140001 0.1948232
## Test set     408.29168122 437.80641 408.29168 11.16654613 11.166546 2.4299059
##                     ACF1 Theil's U
## Training set -0.02300429        NA
## Test set      0.74804521  5.981476
Holt’s Method with Damping
holt_damped_model = holt(train_NYC, h=12, damped = T)
holt_damped_model$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2023 3377.408 3362.281 3350.160 3340.449 3332.668 3326.434 3321.439 3317.437
##           Sep      Oct      Nov      Dec
## 2023 3314.230 3311.661 3309.603 3307.954
accuracy(holt_damped_model,x=NYC_ts)
##                       ME      RMSE       MAE        MPE     MAPE      MASE
## Training set   0.9865076  40.41997  30.97691 0.04057374 1.077790 0.1843559
## Test set     293.4397406 318.76429 293.43974 7.99663627 7.996636 1.7463764
##                     ACF1 Theil's U
## Training set -0.02522248        NA
## Test set      0.71559956   4.33274
Holt_Winter’s Additive
hw_additive = hw(train_NYC,h=12,seasonal = 'additive', damped=T)
hw_additive$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2023 3382.099 3385.140 3391.893 3447.936 3464.794 3493.813 3490.153 3496.547
##           Sep      Oct      Nov      Dec
## 2023 3478.582 3446.330 3409.390 3412.947
accuracy(hw_additive,x = NYC_ts)
##                     ME     RMSE       MAE        MPE     MAPE      MASE
## Training set   2.49225  38.5647  29.13235 0.08017633 1.013622 0.1733782
## Test set     182.78142 197.8411 182.78142 4.98266301 4.982663 1.0878048
##                   ACF1 Theil's U
## Training set 0.1784840        NA
## Test set     0.6785816  2.672182
Holt_Winter’s Multiplicative
hw_multiplicative = hw(train_NYC,h=12,seasonal = 'multiplicative', damped=T)
hw_multiplicative$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2023 3410.151 3415.702 3434.519 3526.436 3576.448 3634.081 3665.192 3684.871
##           Sep      Oct      Nov      Dec
## 2023 3672.138 3636.529 3582.870 3570.109
accuracy(hw_multiplicative,x=NYC_ts)
##                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set  1.57594 37.21519 28.97611 0.0577228 1.008207 0.1724483 0.2358949
## Test set     56.99625 96.63299 88.58094 1.5359082 2.433496 0.5271803 0.7421122
##              Theil's U
## Training set        NA
## Test set      1.301538

H1.3 ETS Models

ETS AAA (not white noise)
ets_aaa = ets(train_NYC,model = 'AAA')
summary(ets_aaa)
## ETS(A,Ad,A) 
## 
## Call:
##  ets(y = train_NYC, model = "AAA") 
## 
##   Smoothing parameters:
##     alpha = 0.9619 
##     beta  = 0.1389 
##     gamma = 0.0279 
##     phi   = 0.8238 
## 
##   Initial states:
##     l = 2739.3009 
##     b = 16.7025 
##     s = -28.6161 -33.2154 2.0309 34.9963 57.3081 42.5817
##            53.8002 22.091 6.3266 -44.6203 -48.5522 -64.1309
## 
##   sigma:  40.855
## 
##      AIC     AICc      BIC 
## 1963.307 1968.300 2018.204 
## 
## Training set error measures:
##                   ME    RMSE      MAE        MPE     MAPE      MASE     ACF1
## Training set 2.49225 38.5647 29.13235 0.08017633 1.013622 0.1733782 0.178484
checkresiduals(ets_aaa)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,Ad,A)
## Q* = 59.061, df = 24, p-value = 8.657e-05
## 
## Model df: 0.   Total lags used: 24

The residuals from the ETS(A,Ad,A) model are not white noise since there is evidence of autocorrelation at lag 24. In this case, because the residuals show significant autocorrelation, it is advisable to revisit the model specification.

ETS Automatic Selection
ets_auto = ets(train_NYC)
summary(ets_auto)
## ETS(M,Ad,N) 
## 
## Call:
##  ets(y = train_NYC) 
## 
##   Smoothing parameters:
##     alpha = 0.8185 
##     beta  = 0.6798 
##     phi   = 0.8 
## 
##   Initial states:
##     l = 2581.9604 
##     b = 23.9438 
## 
##   sigma:  0.0144
## 
##      AIC     AICc      BIC 
## 1954.362 1954.926 1972.662 
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.123409 40.80343 31.29631 0.04537647 1.089583 0.1862568
##                    ACF1
## Training set -0.0151643
ets_auto_forecast = forecast(ets_auto,h=12)
accuracy(ets_auto_forecast,x = NYC_ts)
##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set   1.123409  40.80343  31.29631 0.04537647 1.089583 0.1862568
## Test set     296.042525 321.34428 296.04253 8.06821713 8.068217 1.7618666
##                    ACF1 Theil's U
## Training set -0.0151643        NA
## Test set      0.7157938  4.368213

H1.4 Arima

Auto-arima models
model_auto = auto.arima(y = train_NYC,d = 1,D = 1,stepwise = F,approximation = F)
model_auto
## Series: train_NYC 
## ARIMA(2,1,0)(0,1,1)[12] 
## 
## Coefficients:
##          ar1     ar2     sma1
##       0.2190  0.4040  -0.7446
## s.e.  0.0767  0.0779   0.0760
## 
## sigma^2 = 1392:  log likelihood = -723.99
## AIC=1455.97   AICc=1456.26   BIC=1467.82
checkresiduals(model_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,0)(0,1,1)[12]
## Q* = 21.743, df = 21, p-value = 0.4145
## 
## Model df: 3.   Total lags used: 24
#Forecast
arima_auto_forecast <- forecast(model_auto, h=12)
accuracy(arima_auto_forecast, x = NYC_ts)
##                       ME      RMSE       MAE        MPE      MAPE      MASE
## Training set   0.6321782  35.34529  26.67719 0.02739003 0.9221564 0.1587665
## Test set     128.9062272 144.36246 128.90623 3.51018986 3.5101899 0.7671721
##                    ACF1 Theil's U
## Training set 0.02671677        NA
## Test set     0.66884334  1.946462

Model Summary

accuracy_table = rbind(seasonal_naive_model = accuracy(f = seasonal_naive_model,x = NYC_ts)[2,],
      drift_model = accuracy(f = drift_model,x = NYC_ts)[2,],
      ses_model = accuracy(f = ses_model,x = NYC_ts)[2,],
      holt_model = accuracy(f = holt_model,x = NYC_ts)[2,],
      holt_damped_model = accuracy(f = holt_damped_model,x = NYC_ts)[2,],
      hw_additive_model = accuracy(f = hw_additive,x = NYC_ts)[2,],
      hw_multiplicative = accuracy(f = hw_multiplicative,x = NYC_ts)[2,],
      ets_auto = accuracy(ets_auto_forecast,x = NYC_ts)[2,],
      arima = accuracy(arima_auto_forecast,x=NYC_ts)[2,]
      )
## Warning in accuracy.default(f = seasonal_naive_model, x = NYC_ts): Using `f` as
## the argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = drift_model, x = NYC_ts): Using `f` as the
## argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = ses_model, x = NYC_ts): Using `f` as the
## argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = holt_model, x = NYC_ts): Using `f` as the
## argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = holt_damped_model, x = NYC_ts): Using `f` as
## the argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = hw_additive, x = NYC_ts): Using `f` as the
## argument for `accuracy()` is deprecated. Please use `object` instead.
## Warning in accuracy.default(f = hw_multiplicative, x = NYC_ts): Using `f` as
## the argument for `accuracy()` is deprecated. Please use `object` instead.
accuracy_table
##                             ME      RMSE       MAE       MPE      MAPE
## seasonal_naive_model 303.58333 348.70104 303.58333  8.417811  8.417811
## drift_model          187.75538 219.94235 187.75538  5.086269  5.086269
## ses_model            219.41717 247.76192 219.41717  5.959153  5.959153
## holt_model           408.29168 437.80641 408.29168 11.166546 11.166546
## holt_damped_model    293.43974 318.76429 293.43974  7.996636  7.996636
## hw_additive_model    182.78142 197.84110 182.78142  4.982663  4.982663
## hw_multiplicative     56.99625  96.63299  88.58094  1.535908  2.433496
## ets_auto             296.04253 321.34428 296.04253  8.068217  8.068217
## arima                128.90623 144.36246 128.90623  3.510190  3.510190
##                           MASE      ACF1 Theil's U
## seasonal_naive_model 1.8067449 0.7521645  4.295927
## drift_model          1.1174068 0.7012110  2.980163
## ses_model            1.3058386 0.7082495  3.360923
## holt_model           2.4299059 0.7480452  5.981476
## holt_damped_model    1.7463764 0.7155996  4.332740
## hw_additive_model    1.0878048 0.6785816  2.672182
## hw_multiplicative    0.5271803 0.7421122  1.301538
## ets_auto             1.7618666 0.7157938  4.368213
## arima                0.7671721 0.6688433  1.946462
autoplot(train_NYC, color='black')+
  autolayer(test_NYC,size=1.2,color='black')+
  autolayer(seasonal_naive_model,series = 'Seasonal Naive Model',PI=F)+
  autolayer(drift_model,series = 'Drift Model',PI=F)+
  autolayer(ses_model,series = 'SES Model',PI=F)+
  autolayer(holt_model,series = 'Holt',PI=F)+
  autolayer(holt_damped_model,series = 'Holt Damped',PI=F)+
  autolayer(hw_additive,series = 'Holt Winter Additive',PI=F)+
  autolayer(hw_multiplicative,series = 'Holt Winter Multiplicative',PI=F)+
  autolayer(ets_auto_forecast,series = 'ETS Auto',PI=F)+
  autolayer(arima_auto_forecast,series = 'Arima Auto',PI=F)+
  labs(title = "Prediction Performance: Median Rent Over Time NYC", 
       x = "Date", 
       y = "Price") 

Model Selection: To obtain predictions with the highest accuracy, we compare the performance of simple forecasting, exponential smoothing, ETS and ARIMA models on the testing data. As observed in Figure 4, the Holt Winter multiplicative method closely mimics the testing data. This model is effective as the data exhibits both seasonal and trend variations. It provides us with the lowest RMSE score at 96.63. Additionally, the diagnostic checks show the residuals behaving as white noise, with no apparent autocorrelations, affirming the model’s fitness. The model’s forecast aligns well with the actual data when visualized, offering credibility for future rental price predictions.

Predict 2024-2025

NYC_forecasted_2425 <- hw(train_NYC,h=36,seasonal = 'multiplicative', damped=T)
NYC_forecasted_2425$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2023 3410.151 3415.702 3434.519 3526.436 3576.448 3634.081 3665.192 3684.871
## 2024 3564.451 3559.971 3569.977 3656.353 3699.560 3751.000 3775.432 3788.509
## 2025 3636.349 3627.194 3633.096 3716.889 3756.926 3805.479 3826.799 3836.800
##           Sep      Oct      Nov      Dec
## 2023 3672.138 3636.529 3582.870 3570.109
## 2024 3768.737 3726.022 3665.371 3647.043
## 2025 3813.748 3767.721 3703.813 3682.891
autoplot(NYC_forecasted_2425)

Forecasting: By observing Figure 4, We can observe that there is an upward trend in the median rent prices. The forecast for January 2023 starts at USD3,410, and by December 2025, it rises to USD3,682. The method incorporates seasonality and a damping factor that gradually reduces the trend component over time, providing a point forecast with a confidence interval, as indicated by the shaded region. The multiplicative nature of the method suggests that seasonal changes are proportionally related to the level of the time series. This steady increase could be attributed to growth in the rental market due to inflation, housing demand and other macroeconomic factors.

H2 Noise Data Clustering

Research Question: How do the noise complaint types vary across neighborhoods in New York City?

Noise Complaint Analysis: How do noise complaints vary across various neighborhoods in the city? To analyze this, we can implement cluster analysis to group neighborhoods on the basis of the various types of noise complaints. This will help identify the characteristics of a given neighborhood and determine whether it is suitable for a residents requirements.

Profile Clustering Pre-processing

cleaned_noise_final =read.csv("/Users/tianlanmac16/Desktop/Columbia AA/5205-R/StatSquad/NOISE/cleaned_noise_final.csv")
library(dplyr)
library(skimr)
library(tidyr)
library(lubridate)
H2_data<- cleaned_noise_final%>%
  mutate(year = year(Created.Date))
H2_data = H2_data %>%
  filter(year %in% c(2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022))

Summary by Total Complaints

H2_sum = H2_data %>%
  group_by(Community.Board, Borough) %>%
  summarise(complaints = n()) %>%
  arrange(desc(complaints))
## `summarise()` has grouped output by 'Community.Board'. You can override using
## the `.groups` argument.
sum(H2_sum$complaints)
## [1] 5279309

In 2012-2022, top5 community board that received most complaints

H2_sum$Community.Board[1:5]
## [1] "12 MANHATTAN" "12 BRONX"     "01 BROOKLYN"  "03 MANHATTAN" "10 MANHATTAN"

In 2012-2022, top5 community board that received least complaints

H2_sum$Community.Board[67:71]
## [1] "28 BRONX"    "56 BROOKLYN" "84 QUEENS"   "80 QUEENS"   "83 QUEENS"

Explanatory Plot

each borough, by community board, display total complaints using facet.grid

library(ggplot2)
library(dplyr)

# Assuming H2_sum is already loaded
# Identify the highest and lowest points in each Borough
aaa <- H2_sum %>%
  group_by(Borough) %>%
  mutate(
    PointType = case_when(
      complaints == max(complaints) ~ "Highest",
      complaints == min(complaints) ~ "Lowest",
      TRUE ~ "Other"
    )
  ) %>%
  ungroup()

# Plot with points highlighted
ggplot(aaa, aes(x=Community.Board, y=complaints)) +
  geom_point(aes(color = PointType, shape = PointType)) + # Color and shape based on the PointType
  scale_shape_manual(values=c(20,5,20)) + # Assigning specific shapes to points
  scale_color_manual(values=c("red", "blue", "black")) + # Assigning colors to the highest and lowest points
  scale_y_continuous(labels = scales::comma) + # To change y-axis labels to plain numbers
  facet_grid(.~Borough) +
  labs(x = 'Community Board', y = 'Total Complaints', title = 'Total Complaints by Community Board and Borough') +
  geom_text(aes(label = Community.Board), position = position_dodge(width = 0.9), hjust = -0.1, size = 1.5) +
  theme(legend.position = "none") # Hide the legend

ggplot(aaa, aes(x=Community.Board, y=complaints)) +
  geom_point(aes(color = PointType, shape = PointType), size = 3) + # Color and shape based on the PointType
  scale_shape_manual(values=c(20,5,20)) + # Assigning specific shapes to points
  scale_color_manual(values=c("red", "blue", "black")) + # Assigning colors to the highest and lowest points
  scale_y_continuous(labels = scales::comma) + # To change y-axis labels to plain numbers
  facet_grid(.~Borough, scales = "free_x") + # Use free scales for the x-axis
  labs(x = 'Community Board', y = 'Total Complaints', title = 'Total Complaints by Community Board and Borough') +
  geom_text(data = subset(aaa, PointType != "Other"), aes(label = Community.Board), hjust = 1.1, vjust = 0, size = 1) +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    legend.position = "none",
    strip.text.x = element_text(size = 3) # Ensure facet labels are small enough to fit
  ) # Hide the legend and x-axis text/ticks

Initial observations: Upon visualizing the data, the number of complaints from each Community Board can be observed. 12 MANHATTAN (including the neighborhoods of Inwood and Washington Heights) and 12 BRONX (including neighborhoods of Edenwald, Wakefield, Williamsbridge, Woodlawn, Fish Bay, Eastchester, Olinville, and Baychester) appear to have a significantly high number of complaints while Staten Island has the least number of complaints.

H2_c = H2_data %>%
  group_by(Complaint.Type, Community.Board, Borough) %>%
  summarise(complaints = n()) 
## `summarise()` has grouped output by 'Complaint.Type', 'Community.Board'. You
## can override using the `.groups` argument.
sum(H2_c$complaints)
## [1] 5279309

Pivot Wider for future clustering

H2_c <- H2_c %>%
  spread(key = Complaint.Type, value = complaints, fill = 0)
write.csv(H2_c, 'H2_datac.csv',row.names = F)
write.csv(H2_sum, 'H2_sum.csv',row.names = F)

Read for Clustering

data = read.csv(file = 'H2_datac.csv',stringsAsFactors = F)
data_cluster = data[,3:12]

Scale

data_cluster = scale(data_cluster)
head(data_cluster[,1:10])
##      Collection.Truck.Noise       Noise Noise...Commercial Noise...Helicopter
## [1,]             -0.3792224 -0.63139849        -0.47927000        -0.30990948
## [2,]              0.5314244  1.81479270         3.79085294        -0.22562332
## [3,]              0.1671657  1.79003636         0.03230396        -0.01826716
## [4,]              1.6242004  0.77598766         2.70223534        -0.20943794
## [5,]              0.3796499  0.05348697        -0.27000800        -0.24364101
## [6,]             -0.6827713 -0.80757713        -0.64086899        -0.31235255
##      Noise...House.of.Worship Noise...Park Noise...Residential
## [1,]              -0.26144715   -0.3905683           0.1459226
## [2,]               0.65630352    0.9670819           0.7177436
## [3,]              -0.46608074   -0.3603983          -0.7857963
## [4,]              -0.03510999    0.4775050           0.3491431
## [5,]              -0.28935173   -0.5496465           0.2587831
## [6,]              -0.27074868   -0.7402661          -0.4653364
##      Noise...Street.Sidewalk Noise...Vehicle Noise.Survey
## [1,]              0.01921795      -0.1113019   -0.3979640
## [2,]              1.06722254       1.3230505    1.5244315
## [3,]             -0.12698780      -0.1928585   -0.2848819
## [4,]              0.09173061       0.2757294    0.9665599
## [5,]             -0.37055162      -0.1432280    0.3483778
## [6,]             -0.25616150      -0.3613413   -0.4055028

H2.1 H-cluster

d = dist(x = data_cluster,method = 'euclidean') 
clusters = hclust(d = d,method='ward.D2')
plot(clusters)

cor(cophenetic(clusters),d)
## [1] 0.593858

CPC> 0.7 indicates relatively strong fit, 0.3<CPC<0.7 indicates moderate fit.

2/3/4 h-clusters

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(gridExtra)

grid.arrange(fviz_dend(x = clusters,k=2),
             fviz_dend(x = clusters,k=3),
             fviz_dend(x = clusters,k=4)
)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Selecting Clusters ==3

h_segments3 = cutree(tree = clusters,k=3)
table(h_segments3)
## h_segments3
##  1  2  3 
## 46 13 12
library(cluster)
clusplot(data_cluster,
         h_segments3,
         color=T,shade=T,labels=4,lines=0,main='Hierarchical Cluster Plot')

H2.2 Kmeans Clustering

Determing Clusters

within_ss = sapply(1:10,FUN = function(x){
  set.seed(617)
  kmeans(x = data_cluster,centers = x,iter.max = 1000,nstart = 25)$tot.withinss})
  
ggplot(data=data.frame(cluster = 1:10,within_ss),aes(x=cluster,y=within_ss))+
  geom_line(col='steelblue',size=1.2)+
  geom_point()+
  scale_x_continuous(breaks=seq(1,10,1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ratio_ss = sapply(1:10,FUN = function(x) {
  set.seed(617)
  km = kmeans(x = data_cluster,centers = x,iter.max = 1000,nstart = 25)
  km$betweenss/km$totss} )
ggplot(data=data.frame(cluster = 1:10,ratio_ss),aes(x=cluster,y=ratio_ss))+
  geom_line(col='steelblue',size=1.2)+
  geom_point()+
  scale_x_continuous(breaks=seq(1,10,1))

library(cluster)
silhoette_width = sapply(2:10,
                         FUN = function(x) pam(x = data_cluster,k = x)$silinfo$avg.width)
ggplot(data=data.frame(cluster = 2:10,silhoette_width),aes(x=cluster,y=silhoette_width))+
  geom_line(col='steelblue',size=1.2)+
  geom_point()+
  scale_x_continuous(breaks=seq(2,10,1))

Kmeans clustering ==3

set.seed(617)
km3 = kmeans(x = data_cluster,centers = 3,iter.max=10000,nstart=25)
k_segments3 = km3$cluster
table(k_segments3)
## k_segments3
##  1  2  3 
## 18 49  4

Visualize

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## The following object is masked from 'package:randomForest':
## 
##     outlier
temp3 = data.frame(cluster = factor(k_segments3),
           factor1 = fa(data_cluster,nfactors = 2,rotate = 'varimax')$scores[,1],
           factor2 = fa(data_cluster,nfactors = 2,rotate = 'varimax')$scores[,2])
ggplot(temp3,aes(x=factor1,y=factor2,col=cluster))+
  geom_point()

library(cluster)
clusplot(data_cluster,
         k_segments3,
         color=T,shade=T,labels=4,lines=0,main='k-means Cluster Plot')

H2.3 Model Based Clustering (Dropped)

Plot of BIC

library(mclust)
## Package 'mclust' version 6.0.1
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:psych':
## 
##     sim
mclust_bic = sapply(1:10,FUN = function(x) -Mclust(data_cluster,G=x)$bic)
mclust_bic
##     EEE,1                                                                       
## 1746.8072 1098.7971  910.0312  866.0948 1537.5045 1672.9681 1303.1518 1337.1515 
##                     
## 1464.8442 1408.5478
ggplot(data=data.frame(cluster = 1:10,bic = mclust_bic),aes(x=cluster,y=bic))+
  geom_line(col='steelblue',size=1.2)+
  geom_point()+
  scale_x_continuous(breaks=seq(1,10,1))

M-clustering ==3

library(mclust)
m_clusters3 = Mclust(data_cluster, G=3)
summary(m_clusters3)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 3 components: 
## 
##  log-likelihood  n  df       BIC       ICL
##       -73.50577 71 179 -910.0312 -910.0314
## 
## Clustering table:
##  1  2  3 
## 34 26 11
m_segments3 = m_clusters3$classification
table(m_segments3)
## m_segments3
##  1  2  3 
## 34 26 11

Visualize

library(cluster)
clusplot(data_cluster,
         m_segments3,
         color=T,shade=T,labels=4,lines=0,main='mclust Cluster Plot')

Contrast Result

table(h_segments3)
## h_segments3
##  1  2  3 
## 46 13 12
table(k_segments3)
## k_segments3
##  1  2  3 
## 18 49  4
table(m_segments3)
## m_segments3
##  1  2  3 
## 34 26 11

Model Selection: A K-means model is selected for this analysis, as it is scalable and easy to implement. A model with 3 clusters is selected as it has the lowest Silhouette Width as viewed in Figure 7. Hierarchical clustering and model-based clustering are also tested out in both 3 and 4 clusters, however, the results from such are less easily interpretable when looking at the clustering profile. One noticeable community board stood out when setting the clusters between 3 and 4 in both hierarchical or K-means clustering, which is 08 Queens (including the neighborhoods of Jamaica Estates, Holliswood, and Flushing South). This areas shows high complaints from residential, vehicle and truck noise while holding the least complaints from the commercial category, indicating a noisy area with a less vibrant lifestyle near the JFK airport.

Profile (based on K-means 3 clusters)

data2 = cbind(data, k_segments3)
library(dplyr)
data2 %>%
  select(Collection.Truck.Noise:Noise.Survey,k_segments3)%>%
  group_by(k_segments3)%>%
  summarize_all(function(x) round(mean(x,na.rm=T),0))%>%
  data.frame()
##   k_segments3 Collection.Truck.Noise Noise Noise...Commercial
## 1           1                     26 10829              13797
## 2           2                     16  4656               3534
## 3           3                    120 29820              11606
##   Noise...Helicopter Noise...House.of.Worship Noise...Park Noise...Residential
## 1                461                      327         1503               73297
## 2                431                      109          531               26096
## 3              10925                      131          556               31257
##   Noise...Street.Sidewalk Noise...Vehicle Noise.Survey
## 1                   32404           13256          267
## 2                    7001            3188           68
## 3                   12380            6142          168
data2 %>%
  select(Collection.Truck.Noise:Noise.Survey, k_segments3) %>%
  group_by(k_segments3) %>%
  summarize_all(function(x) round(mean(x, na.rm = T), 0)) %>%
  gather(key = var, value = value, Collection.Truck.Noise:Noise.Survey) %>%
  mutate(var = ifelse(var == "Noise", "Noise.Random", var)) %>%
  ggplot(aes(x = var, y = value, fill = factor(k_segments3))) +
  geom_col(position = 'dodge') +
  coord_flip() +
  geom_text(aes(label = value), position = position_dodge(width = 0.9), hjust = -0.1, size = 1.5) +
  theme(legend.position = "bottom", 
        legend.box = "horizontal", 
        legend.key.size = unit(0.5, "cm"), 
        legend.text = element_text(size = 6), 
        legend.title = element_text(size = 7)) 

Community.Board Count k_segments4 1 2 3 4 17 49 4 1 08 Queens stood out

write.csv(data2, 'data2.csv',row.names = F)

Segments Breaking Down

# Add a new column with the sum of columns 3 through 12
cluster_profile <- mutate(data2, total_complaints = rowSums(data2[, 3:12], na.rm = TRUE))

cluster_profile <- cluster_profile %>%
  select(1:2, total_complaints, everything())
table(cluster_profile$k_segments3,data2[,2])
##    
##     BRONX BROOKLYN MANHATTAN QUEENS STATEN ISLAND
##   1     5        5         6      2             0
##   2    10       15         3     17             4
##   3     0        0         4      0             0

Segment1

cluster_profile %>%
  filter(k_segments3 == 1) %>%
  group_by(Community.Board, Borough) %>%
  select(1:13) %>%
  arrange(total_complaints)
## # A tibble: 18 × 13
## # Groups:   Community.Board, Borough [18]
##    Community.Board Borough   total_complaints Collection.Truck.Noise Noise
##    <chr>           <chr>                <dbl>                  <int> <int>
##  1 07 BROOKLYN     BROOKLYN             52834                     22  7093
##  2 02 MANHATTAN    MANHATTAN            96231                     52 22785
##  3 11 MANHATTAN    MANHATTAN           100502                     16  6915
##  4 02 BROOKLYN     BROOKLYN            100663                     31 21005
##  5 08 QUEENS       QUEENS              101814                     38 17885
##  6 09 BRONX        BRONX               111318                     11  3342
##  7 01 QUEENS       QUEENS              117039                     78 14096
##  8 04 BROOKLYN     BROOKLYN            122780                      5  7271
##  9 09 MANHATTAN    MANHATTAN           132066                     31  7756
## 10 05 BRONX        BRONX               136208                      3  1762
## 11 03 BROOKLYN     BROOKLYN            136652                     11 10529
## 12 07 BRONX        BRONX               152444                     11  4216
## 13 04 BRONX        BRONX               163385                      3  2464
## 14 10 MANHATTAN    MANHATTAN           167450                     24  9224
## 15 03 MANHATTAN    MANHATTAN           171221                     40 21412
## 16 01 BROOKLYN     BROOKLYN            172334                     42 22740
## 17 12 BRONX        BRONX               279758                      6  2643
## 18 12 MANHATTAN    MANHATTAN           316333                     45 11788
## # ℹ 8 more variables: Noise...Commercial <int>, Noise...Helicopter <int>,
## #   Noise...House.of.Worship <int>, Noise...Park <int>,
## #   Noise...Residential <int>, Noise...Street.Sidewalk <int>,
## #   Noise...Vehicle <int>, Noise.Survey <int>

Segment2

cluster_profile %>%
  filter(k_segments3 == 2) %>%
  group_by(Community.Board, Borough) %>%
  select(1:13)
## # A tibble: 49 × 13
## # Groups:   Community.Board, Borough [49]
##    Community.Board  Borough       total_complaints Collection.Truck.Noise Noise
##    <chr>            <chr>                    <dbl>                  <int> <int>
##  1 01 BRONX         BRONX                    68947                     12  2385
##  2 01 MANHATTAN     MANHATTAN                56867                     30 22534
##  3 01 STATEN ISLAND STATEN ISLAND            73369                     37  8084
##  4 02 BRONX         BRONX                    37180                      2   919
##  5 02 QUEENS        QUEENS                   59091                     30 10010
##  6 02 STATEN ISLAND STATEN ISLAND            27109                     21  4981
##  7 03 BRONX         BRONX                    75296                      5  1581
##  8 03 QUEENS        QUEENS                   77251                     12  4751
##  9 03 STATEN ISLAND STATEN ISLAND            26025                     32  5631
## 10 04 QUEENS        QUEENS                   57441                      4  3552
## # ℹ 39 more rows
## # ℹ 8 more variables: Noise...Commercial <int>, Noise...Helicopter <int>,
## #   Noise...House.of.Worship <int>, Noise...Park <int>,
## #   Noise...Residential <int>, Noise...Street.Sidewalk <int>,
## #   Noise...Vehicle <int>, Noise.Survey <int>

Segment3

cluster_profile %>%
  filter(k_segments3 == 3) %>%
  group_by(Community.Board, Borough) %>%
  select(1:13)
## # A tibble: 4 × 13
## # Groups:   Community.Board, Borough [4]
##   Community.Board Borough   total_complaints Collection.Truck.Noise Noise
##   <chr>           <chr>                <dbl>                  <int> <int>
## 1 04 MANHATTAN    MANHATTAN           109874                     37 30683
## 2 06 MANHATTAN    MANHATTAN            79380                     66 26555
## 3 07 MANHATTAN    MANHATTAN           136185                    185 29754
## 4 08 MANHATTAN    MANHATTAN            86978                    190 32288
## # ℹ 8 more variables: Noise...Commercial <int>, Noise...Helicopter <int>,
## #   Noise...House.of.Worship <int>, Noise...Park <int>,
## #   Noise...Residential <int>, Noise...Street.Sidewalk <int>,
## #   Noise...Vehicle <int>, Noise.Survey <int>

Heat Map for Borough

prop.table(table(cluster_profile$k_segments,data2[,2]),1)
##    
##          BRONX   BROOKLYN  MANHATTAN     QUEENS STATEN ISLAND
##   1 0.27777778 0.27777778 0.33333333 0.11111111    0.00000000
##   2 0.20408163 0.30612245 0.06122449 0.34693878    0.08163265
##   3 0.00000000 0.00000000 1.00000000 0.00000000    0.00000000
library(ggplot2)
tab = prop.table(table(cluster_profile$k_segments,data2[,2]),1)
tab2 = data.frame(round(tab,2))
library(RColorBrewer)
ggplot(data=tab2,aes(x=Var2,y=Var1,fill=Freq))+
  geom_tile()+
  geom_text(aes(label=Freq),size=6)+
  xlab(label = '')+
  ylab(label = '')+
  scale_fill_gradientn(colors=brewer.pal(n=9,name = 'Reds'))

Cluster 1: High Levels of Noise Complaints. These neighborhoods have the highest levels of noise complaints, particularly for residential, street/sidewalk, and vehicle noises. This indicates high population density and commercial zones with significant street activity. 18 areas fall under this cluster, including the East and West Village, Park Slope, and the center of Flushing. Individuals who enjoy vibrant, lively environments and prioritize convenience over noise could choose to live here.

Cluster 2: Moderate to Low levels of noise complaints. These neighborhoods may have a small mix of residential and commercial areas. All areas within Staten Island fall under this category, indicating that it is the quietest borough overall. The majority of the Bronx, the Financial District, and Jackson Heights also fall under this category. These neighborhoods are ideal for families and individuals who prioritize peace and tranquility and live in a commutable distance from urban conveniences.

Cluster 3: High to Moderate Levels of Noise Complaints. These neighborhoods have relatively moderate levels from streets and sidewalks indicating that they are not very densely populated. However commercial, helicopter and random noises are significantly higher here than in the other two clusters. This could be due to proximity to helipads, offices, etc. As observed in Figure 9, only Manhattan neighborhoods lie in this cluster, including Chelsea, the Upper West Side and the Upper East Side. These areas can be suitable for residents and families that seek to be within short commutable distance from restaurants, commercial spaces, and attractions.

H3 Spacial Analysis on Crime

Research Question: How do the crimes vary across neighborhoods in New York City?

Crime Level Analysis: How does crime quantity & type differ between neighborhoods? To understand how crime quantity differs between neighborhoods we will use spatial analysis to understand the distribution of crime across New York City. Secondly, we will also use spatial analysis to understand the most complaints per precinct.

H3.1 Crime Complaints NYC

Read Map

library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
##   Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
##   OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(ggplot2)
library(RColorBrewer)

crimedata = read.csv("crime_cleaned.csv")
register_google(key = "AIzaSyAnZcVKsOpJiQf8CDn-I_GEtXSjUmNNNHQ")
newyork_map = get_map(location = c(-73.93835,40.66392),zoom = 10,scale = 4)
## ℹ <https://maps.googleapis.com/maps/api/staticmap?center=40.66392,-73.93835&zoom=10&size=640x640&scale=4&maptype=terrain&language=en-EN&key=xxx-I_GEtXSjUmNNNHQ>

Crime Type over NYC

ggmap(newyork_map) +
  geom_point(data = crimedata, aes(x = Longitude, y = Latitude, color = LAW_CAT_CD), size = 0.5) +
  scale_color_manual(name = "Crime Category", values = c("FELONY" = "red", "MISDEMEANOR" = "orange", "VIOLATION" = "yellow"), labels = c("Felony", "Misdemeanor", "Violation")) +
  theme(legend.text = element_text(size = 12),  # Adjust the size of legend text
        legend.title = element_text(size = 14), # Adjust the size of legend title
        legend.key.size = unit(2, "lines"),    # Adjust the size of legend symbols
        legend.key.height = unit(2, "lines"),  # Increase the height of legend symbols
        legend.key.width = unit(2, "lines")) +  # Increase the width of legend symbols
  guides(color = guide_legend(override.aes = list(size = 5))) +  # Increase the size of color points in the legend
  labs(title = "Crime Complaints Distribution in New York City 2012 - 2022", hjust = 0.5, size = 20)  # Add a centered and larger title to the graph

This map shows the distribution of crime by type (felony, misdemeanor, violation) across the city’s geography. Crime is more concentrated in Manhattan, Southern Bronx & Central Brooklyn. Felonies were the most common category of crime Displaying the frequency of crimes as color-coded points allows for immediate visual identification of areas with higher crime densities. For spatial analysis, this map can help identify hotspots for specific types of crime and observe any geographic patterns or clusters. For example, a high concentration of red points in a particular area would indicate a region with a high incidence of felonies.

H3.2 Most Complaints per Precinct

Set impact

library(ggmap)
library(ggplot2)
library(dplyr)

data = read.csv("crime_summary.csv")

register_google(key = "AIzaSyAnZcVKsOpJiQf8CDn-I_GEtXSjUmNNNHQ")

newyork_map = get_map(location = c(-73.93835,40.66392),zoom = 10,scale = 4)
## ℹ <https://maps.googleapis.com/maps/api/staticmap?center=40.66392,-73.93835&zoom=10&size=640x640&scale=4&maptype=terrain&language=en-EN&key=xxx-I_GEtXSjUmNNNHQ>
classify_impact <- function(ofns_desc, law_cat_cd) {
  if (law_cat_cd == "FELONY") {
    high_impact <- c("HOMICIDE-NEGLIGENT,UNCLASIFIE", "MURDER & NON-NEGL. MANSLAUGHTER",
                     "HOMICIDE-NEGLIGENT,UNCLASSIFIE", "RAPE", "ROBBERY", "ARSON", "KIDNAPPING & RELATED OFFENSES", "DANGEROUS WEAPONS",
                     "SEX CRIMES", "THEFT-FRAUD")
    medium_impact <- c("FELONY ASSAULT", "GRAND LARCENY", "GRAND LARCENY OF MOTOR VEHICLE", 
                       "FORGERY", "POSSESSION OF STOLEN PROPERTY", "DANGEROUS DRUGS", 
                       "MISCELLANEOUS PENAL LAW")
    low_impact <- c("BURGLARY", "CRIMINAL MISCHIEF & RELATED OF", "NYS LAWS-UNCLASSIFIED FELONY",
                    "OBSCENIT", "PROSTITUTION & RELATED OFFENSES", "OTHER STATE LAWS (NON PENAL LA")
  } else if (law_cat_cd == "MISDEMEANOR") {
    high_impact <- c("ASSAULT 3 & RELATED OFFENSES", "CRIMINAL MISCHIEF & RELATED OF",
                     "DANGEROUS WEAPONS", "SEX CRIMES", "OFFENSES RELATED TO CHILDREN")
    medium_impact <- c("OFFENSES AGAINST PUBLIC ADMINI", "OFFENSES AGAINST PUBLIC SAFETY",
                       "OFFENSES AGAINST THE PERSON", "OFFENSES INVOLVING FRAUD",
                       "OTHER OFFENSES RELATED TO THEF", "OFF. AGNST PUB ORD SENSBLTY &", "DANGEROUS DRUGS")
    low_impact <- c("ADMINISTRATIVE CODE", "AGRICULTURE & MRKTS LAW-UNCLASSIFIED",
                    "ANTICIPATORY OFFENSES", "CRIMINAL TRESPASS", "FRAUDS", "FRAUDULENT ACCOSTING",
                    "INTOXICATED & IMPAIRED DRIVING", "JOSTLING", "OTHER STATE LAWS (NON PENAL LA",
                    "PETIT LARCENY", "PETIT LARCENY OF MOTOR VEHICLE", "POSSESSION OF STOLEN PROPERTY",
                    "THEFT OF SERVICES", "UNAUTHORIZED USE OF A VEHICLE", "VEHICLE AND TRAFFIC LAWS")
  } else if (law_cat_cd == "VIOLATION") {
    high_impact <- c("HARASSMENT 2", "HARRASSMENT 2")
    medium_impact <- c("ADMINISTRATIVE CODE", "OTHER STATE LAWS")
    low_impact <- c("MISCELLANEOUS PENAL LAW")
  } else {
    return(NA)  # Return NA for cases with unknown LAW_CAT_CD
  }
  
  if (ofns_desc %in% high_impact) {
    return("High Impact")
  } else if (ofns_desc %in% medium_impact) {
    return("Medium Impact")
  } else if (ofns_desc %in% low_impact) {
    return("Low Impact")
  } else {
    return(NA)  # Return NA for cases not classified
  }
}

data$impact <- mapply(classify_impact, data$OFNS_DESC, data$LAW_CAT_CD)

top_crime_precinct <- data %>%
  group_by(ADDR_PCT_CD, LAW_CAT_CD, impact) %>%
  summarize(total_complaints = sum(complaints)) %>%
  group_by(ADDR_PCT_CD) %>%
  top_n(1, total_complaints) %>%
  ungroup()
## `summarise()` has grouped output by 'ADDR_PCT_CD', 'LAW_CAT_CD'. You can
## override using the `.groups` argument.
top_crime_precinct <- top_crime_precinct %>%
  left_join(data %>% distinct(ADDR_PCT_CD, longitude, latitude, BORO_NM), by = "ADDR_PCT_CD")

Visualize Impact Map

ggmap(newyork_map) +
  geom_point(data = top_crime_precinct, aes(x = longitude, y = latitude, color = LAW_CAT_CD, shape = impact, size = (total_complaints/10))) +
  scale_color_manual(name = "Crime Category", values = c("FELONY" = "red", "MISDEMEANOR" = "orange", "VIOLATION" = "yellow"), labels = c("Felony", "Misdemeanor", "Violation")) +
  scale_shape_manual(name = "Impact Level", values = c("High Impact" = 15, "Medium Impact" = 17, "Low Impact" = 19)) +
  scale_size_continuous(name = "Total Complaints", guide = "none") +
  theme(legend.text = element_text(size = 12),  
        legend.title = element_text(size = 14), 
        legend.key.size = unit(2, "lines")) +   
  guides(color = guide_legend(override.aes = list(size = 5)),  
         shape = guide_legend(override.aes = list(size = 5))) +  
  labs(title = "Most Complaints in each Precint by type of Crime and Severity in New York City 2012 - 2022", hjust = 0.5, size = 20)  

This map takes a different approach by focusing on the most common complaints per precinct, categorized by the severity of the crime. Distribution of the most complaints by crime category across NYC precincts Most complaints in Midtown & Lower Manhattan were medium-impact felonies. Low-impact misdemeanors were the category with the most complaints in Staten Island It provides a summarized view where each precinct is represented by a single symbol that characterizes the most frequent type of complaint. This helps in quickly identifying which precincts have higher levels of certain types of crimes and how the impact level of crimes varies across precincts.

Table

top_five_crime_impact <- top_crime_precinct %>%
  arrange(desc(total_complaints)) %>%
  slice_head(n = 5) %>%
  select(BORO_NM, ADDR_PCT_CD, LAW_CAT_CD, impact, total_complaints) %>%
  rename(
    Borough = BORO_NM,
    Precinct = ADDR_PCT_CD,
    'Category of Crime' = LAW_CAT_CD,
    Impact = impact,
    'Number of Complaints' = total_complaints
  )

top_five_crime_impact
## # A tibble: 5 × 5
##   Borough   Precinct `Category of Crime` Impact        `Number of Complaints`
##   <chr>        <int> <chr>               <chr>                          <int>
## 1 BRONX           52 MISDEMEANOR         Low Impact                       143
## 2 QUEENS         114 FELONY              Medium Impact                    102
## 3 MANHATTAN       19 FELONY              Medium Impact                     95
## 4 BROOKLYN        67 MISDEMEANOR         Low Impact                        90
## 5 BROOKLYN        75 MISDEMEANOR         Low Impact                        89

While this study is comprehensive, we encountered several limitations that warrant mention. We found that connecting our datasets together was quite a challenge due to different neighborhood identifiers such as neighborhood name, zip code, precinct and community board. Additionally, the lack of publicly available data on rental amenities limited the analysis we were able to conduct on rental prices in the city. Further research would include the formation of a directory that would allow us to connect neighborhoods with greater ease and the gathering of unit specific amenities. Additionally, now that we have the data to determine what needs to be improved for the various neighborhoods, we would want to conduct further research into what exact actions can be taken to improve the situations these communities face related to crime & noise.

Written by Group 6 (Stat Squad: Abhay Shah, David Skorodinsky, Javier Miguelez, Sarah Doctor, Tian Lan)